Skip to content

[WIP] Issue 7 : pbs cluster init, class inheritance refactoring#10

Merged
jhamman merged 10 commits into
dask:masterfrom
guillaumeeb:7_pbs_cluster_init
Mar 27, 2018
Merged

[WIP] Issue 7 : pbs cluster init, class inheritance refactoring#10
jhamman merged 10 commits into
dask:masterfrom
guillaumeeb:7_pbs_cluster_init

Conversation

@guillaumeeb

Copy link
Copy Markdown
Member

So looking for comment here as this is my first pull request ever.

I've made some significant changes from the master branch:

  • As issue Not setting project should not raise  #7 first mentioned, I've remove the check on project being defined or not,
  • I've made the PBS header construction, but also the dask-worker command construction more flexible, as suggested by @jhamman,
  • I've moved some logic from PBS/SLURM classes to the JobQueueCluster parent class. It follows an abstract class pattern, but I've decided not to use ABC Metaclass for the sake of simplicity and python 2.7 easier compatibility as discussed with @mrocklin. So I've juste defined some abstract method by raising NotImplementedError in the JobQueueCluster class.

A question too on SLURMCluster. I've modified its implementation, but I've left the fixed header template logic here, as I am not a SLURM user. Anyway I still have some questioning about the CPU and memory resources in this class : we use only the processes (so nprocs) of dask here, but I would expect to see something like processes * nthreads; moreover, I don't see any memory limit in the template.

Finaly, I've not yet tested this in real condition, only some basic play with job script construction into a python terminal (and by the way I believe we should add some basic unit tests like that, don't you think?).

@mrocklin mrocklin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking for comment here as this is my first pull request ever.

Welcome! I hope that it is a positive experience for you. I believe that this work will have some impact, so I hope that you stay involved.

I've added a few quick comments below. They are mostly stylistic rather than substantial. I hope to have more time to try this out on a PBS system soon, although I may not have time until one week from today (going on vacation). Hopefully the others involved in this project will also be able to raise some thoughts here.

Comment thread dask_jobqueue/pbs.py Outdated
"""

_submitcmd = 'qsub'
_cancelcmd = 'qdel'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to rename these to the more explicit (if perhaps also longer) submit_command and cancel_command and remove the properties that point to them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll try this, comming from Java background, it was more natural to me to override kind of methods instead of attributes :).

Comment thread dask_jobqueue/pbs.py Outdated
if queue != None: header_lines.append('#PBS -q %s' % queue)
if project != None: header_lines.append('#PBS -A %s' % project)
if resource_spec != None: header_lines.append('#PBS -l %s' % resource_spec)
if walltime != None: header_lines.append('#PBS -l walltime=%s' % walltime)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other dask related projects use the flake8 style checker, which would probably raise some warnings here for combining the if block into a single line. You might want to run the following from the root of this directory:

flake8 dask_jobqueue

But we should probably add this to CI so that we can automate this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. Perhaps we could enable Travis CI to do some basic tests and style checking, without having all the docker testing env yet? Should we raise an issue separated from #2 ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this in #13. You can also run the same checker locally by running: flake8 -j auto dask_jobqueue

Comment thread dask_jobqueue/core.py Outdated
:param death_timeout:
:param local_directory:
:param extra:
:param kwargs:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we've been using numpydoc style docstrings: http://dask.pydata.org/en/latest/develop.html#docstrings

Comment thread dask_jobqueue/core.py Outdated
Abstract attribute for job scheduler cancel command
:return:
"""
raise NotImplementedError

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of removing these properties and just going with class attributes if possible

Comment thread dask_jobqueue/pbs.py
job_extra : list
List of other PBS options, for example -j oe. Passed with '#PBS ' prefix
local_directory : str
Dask worker local directory for file spilling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove threads_per_worker above? Also did you want to logically separate the dask options from the PBS options?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, scrolling down I see that you've done just that. If so then local_directory should probably be an inherited parameter. I think it should be generic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I separated options, and for more consistancy with other keywords I used threads instead of thread_per_worker.
local_directory is generic, but I wanted to override it in PBSCluster, because standard PBS installation always provide a $TMPDIR variable which resolve to something like /tmp/{JOB_ID}/. I wanted to make this as the default local_directory for PBSCluster workers.

@guillaumeeb

Copy link
Copy Markdown
Member Author

So I've pushed another commit with @mrocklin comments taken care of. Tell me how you feel about it.

@guillaumeeb

Copy link
Copy Markdown
Member Author

Oh, And I've tested on our cluster, and it worked like a charm (even if I did not push the really far).

@jhamman jhamman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like how this turned out. I've made a number of small comments, mostly targeting how users will interpret different options.

Comment thread dask_jobqueue/core.py Outdated
self._adaptive = None

#dask-worker command line build
self._command_template = "%s/dask-worker %s" % (dirname, self.scheduler.address)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly safer to write this as:

self._command_template = os.path.join(dirname, 'dask-worker' % self.scheduler.address)

Comment thread dask_jobqueue/core.py Outdated
-------
A string containing multiple lines to be used as header of job file to be sumitted.
"""
raise NotImplementedError

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just make this a class attribute that defaults to an empty string.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I tried to keep one of my "contract" of Abstract Class, but I guess it is still too Java oriented :).

Comment thread dask_jobqueue/pbs.py Outdated
--------
>>> from dask_jobqueue import PBSCluster
>>> cluster = PBSCluster(project='...')
>>> cluster = PBSCluster(queue= 'regular', project='DaskOnPBS')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 (remove space after =)

Comment thread dask_jobqueue/pbs.py Outdated
death_timeout=60,
extra='',
job_extra=[],
local_directory='$TMPDIR',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference to set this to None by default. Either that or os.getenv('TMPDIR', None).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believed that $TMPDIR was a PBS default, but apparently it should be PBS_TMPDIR. Can you check on the cluster you are using?
Maybe it is more generic to just remove this arg by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On our system (NCAR's cheyenne), TMPDIR is set to /var/tmp/{JOB_ID}. So given the divergence, I suggest just removing this as a default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then our systems are configured with the same variable (if not set to the same directory), it is difficult to generalise from two examples, but TMPDIR seems to be used for a scratch dir for you and me. Stilled removed it maybe I will add a comment for this.

Comment thread dask_jobqueue/pbs.py Outdated
threads_per_worker=4,
processes=9,
memory='7GB',
resource_spec='select=1:ncpus=24:mem=100GB',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure what to do here. One option is to leave this as None and force (most) users to set this. Keeping it as is may be okay too but everyone will have to change it (except for you 😉). I have a slight preference for setting it to None.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I have the same concern. I found your defaults too high, so I set mines, but it is probably wrong too.

On our cluster if we don't configure it, the default will be select=1:ncpus=1:mem=5GB. So worker jobs will probably be killed promptly if dask worker is set with more than 1 thread or process. And I believe it would be really bad for PBS to launch one job for each 1 process dask-worker.

My preference would be more close to what is mentioned in #12, if nothing is provided, resource_spec is computed using dask-worker options. Still not enthusiast about that? It's a good compromise, keep resource_spec keyword, but build it from dask-worker options if not provided. The only concern if for memory limit, there may be a need to extrapolate from String to integer...

So maybe it is simpler to set it to None :)...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might inspect available memory and cpu cores using tools like psutil and multiprocessing

In [1]: import psutil
In [2]: psutil.virtual_memory().total
Out[2]: 16680607744
In [3]: psutil.cpu_count()
Out[3]: 4

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be okay. The challenge is that you'll routinely have the situation where the machine that scheduler is running on is spec'd differently than the machines where the workers will be running on. I think this may be worth doing as a first guess.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not inclined to base the resource_spec on the submitting node resource...
I would much prefer to guess it from the given threads and processes. Else I would default to None, maybe with a warning?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will tend to be different on every machine - what about incorporating a user config file
that would allow local customization?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b Hmm, could you elaborate?
From my understanding of your suggestion, I would prefer to only have to change a keyword value than having a user config file, but it is just my feeling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the comment from @mrocklin above, for me this could be put into the JobQueueCluster class, for choosing default processes and/or threads and memory. Maybe we ought to set those to None by default?
Commit incoming for demonstrating the automatic resource_spec assignment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cannot reach a common view about this, I propose to continue discussing it in another issue, maybe #12 or a new one. Then for this pull request, I can put resource_spec to None by default, remove the solution I proposed with my last commit if you don't like it, and eventually merge all this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are different from one machine to another, but typically don't need to change from one run to another on the same machine, sure a cli should be provided but a config file would allow the user not to have to remember and add the same cli again and again.

Comment thread dask_jobqueue/pbs.py Outdated
walltime : str
Walltime for each worker job.
job_extra : list
List of other PBS options, for example -j oe. Passed with '#PBS ' prefix

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say: Each option will be prepended with the #PBS prefix.

Comment thread dask_jobqueue/slurm.py Outdated
super(SLURMCluster, self).__init__(name=name, processes=processes, **kwargs)

#TODO has this been tested? This seems weird to use only processes, and not processes * threads
# There are no memory limit given to Slurm either?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the initial implementation left those out but we should provide both. @jedwards4b - do you have a comment here?

Comment thread dask_jobqueue/slurm.py
self._cancelcmd = 'scancel'
# Not used
'memory': memory
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above. I think we'll want to include memory.

Comment thread dask_jobqueue/core.py
local_directory=None,
extra='',
**kwargs
):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these options to the individual cluster doc strings?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I get your point, where do you want to put them? They are already in the JobQueueCluster doc strings. You wan to put it in every implementation?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was just that the JobQueueCluster doc string doesn't show up on the PBSCluster doc string so it requires some unnecessary searching to find all the possible arguments to the constructor.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was already some duplicated doc, I updated it.
Does not feel really good about that though, don't like to duplicate docstrings... Is there a better way?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could directly take the class docs and add to them. i.e. in each child class definition you do something like:

        doc = JobQueueCluster.__doc__
        self.__doc__ +=  "specific docs for Slurm cluster"

Perhaps there's a less hacky way though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat - do I recall you introducing the xarray devs to a decorator approach to doc string templating? I can't seem to dig it up but perhaps someone knows what I'm talking about.

@jhamman jhamman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with the state of this now. I think there are further improvements we can make but we should get this out in the wild.

@mrocklin - any objections?

Comment thread dask_jobqueue/slurm.py Outdated
#TODO has this been tested? This seems weird to use only processes, and not processes * threads
# There are no memory limit given to Slurm either?

#Keeping template for now has I don't know much about slurm.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments like these can probably be removed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And created a new issue #22


def test_basic(loop):
with PBSCluster(walltime='00:02:00', threads_per_worker=2, memory='7GB',
with PBSCluster(walltime='00:02:00', threads=2, memory='7GB',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if this name is perhaps a little ambiguous now. It might be worth it for the shorter name, I'm not sure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it would be me, I would name every corresponding attribute with suffix liker "per_worker" or prefix like "worker_". Like mentioned in #7 (comment).
As it stands, I don't see why threads should be treated differently than memory. Or maybe we should change memory to memory_per_worker, as for processes it actually translates in "number of workers"?

@mrocklin

Copy link
Copy Markdown
Member

In general it's fine with me. I'm also happy to defer to others' judgment.

@guillaumeeb

Copy link
Copy Markdown
Member Author

@jhamman I agree with you. There a lot of improvements to be made, but we see on other issues or PR that we should probably push this as a reference to start working on other improvements, as this one defines a basic implementation strategy for specific Scheduler.

@rabernat

rabernat commented Mar 26, 2018 via email

Copy link
Copy Markdown

@guillaumeeb

Copy link
Copy Markdown
Member Author

@jhamman I've included docrep for docstrings depency as suggested by @rabernat.
Anything else?

@jhamman jhamman merged commit 3ea2dad into dask:master Mar 27, 2018
@jhamman

jhamman commented Mar 27, 2018

Copy link
Copy Markdown
Member

Thanks @guillaumeeb - I hope to work on this a bit tomorrow so I'm merging now.

@jhamman jhamman mentioned this pull request Mar 27, 2018
@lesteve

lesteve commented Apr 18, 2018

Copy link
Copy Markdown
Member

@rabernat I did not know about docrep, nice! Just curious have you used docrep in a project already? If that's the case I would be interested in hearing the kind of experience you had with it!

@rabernat

Copy link
Copy Markdown

Docrep is used in this module: https://github.com/xgcm/xgcm/blob/master/xgcm/grid.py

It's useful when many different docstrings have to share very similar components...avoids repetition and thus potential errors.

@lesteve

lesteve commented Apr 18, 2018

Copy link
Copy Markdown
Member

Nice, thanks for the link!

@guillaumeeb guillaumeeb deleted the 7_pbs_cluster_init branch August 27, 2018 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants